Below, I will be exploring a tidy dataset of white wine to find out what chemical properties most influence the sensory quality of taste. I’ll also look at the chemical properties that have a close relation to others. The quality variable is based on taste testing scores between 0 and 10 (worst to best), all the other variables are based on physiochemical tests.
I want to start by looking at the structure of the dataset.
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
I am going to remove column “X”, as this column is an identifier, and I don’t want these numbers included in any statistics. After that, there will be 11 input variables and 1 output variable (quality), each with 4898 observations.
For each variable, I will start by plotting the original data in a histogram and a boxplot to view the distribution and outliers. This will be followed by a histogram and boxplot of the current subset. I will then remove outliers from the variable in the subset, and plot again.
The method I will be using to remove outliers is by identifying outliers 1.5 x the upper and lower InterQuartile (IQR) bounds (1.5 * IQR).
original:
*The original quality distribution is mostly normal with a few outliers in the upper and lower ends of the range.
subset with outliers removed:
## [1] 4698 12
200 outliers were removed from the original data. This leaves 4698 observations with quality scores only between 4 and 7 in the new subset.
original:
subset:
In the original data, fixed.acidity appears to be normally distributed in the histogram and slightly right skewed. The boxplot shows that there are a few outliers on the upper and lower ends of the range.
The subset data isn’t much different, since the only rows that have been removed are the outliers in the quality variable. I will now remove the fixed.acidity outliers in the subset dataframe.
subset with outliers removed:
## [1] 4560 12
138 outliers removed and fixed.acidity is now normally distributed.
original:
subset:
In the original data, volatile.acidity is bell-curve shaped with uniform distribution on both sides of the peak but becomes slightly right-skewed. There are several outliers in the upper end of the range.
The subset data shows almost the same as the original data. I’ll now remove the volatile.acidity outliers in the subset dataframe.
## [1] 4398 12
162 outliers removed and the distribution is uniform across the peak, then becomes slightly right-skewed but mostly normal.
original:
subset:
Of the original dataset, citric acid distribution is slightly right-skewed with a small spike in the upper range of the shoulder of the curve. There are outliers in the upper and lower ends of the range as shown in the boxplot.
The subset data is close to the original dataset pattern.
## [1] 4138 12
260 outliers removed and the histogram shows that the density peak is flattened out a bit. The small spike is still apparent in the upper range of the shoulder of the curve. The boxplot shows that there are some new outliers, but I will leave these in the data, as I expect they are from the small data spike.
original:
subset:
In the original data, residual sugar distribution is very right-skewed with a high number of observations on the lower end of the range. The boxplot shows that there are outliers in the upper end.
In the subset data, some of the upper-end outliers have already been removed from previous outlier removal. The data is still heavy on the lower end of the range followed by a more uniform spread of the higher values.
## [1] 4130 12
8 outliers were removed and the data is right-skewed.
original:
subset:
Both the original dataset and the subset show that chlorides have a mostly normal distribution, but with outliers in the upper and lower ends of the range.
## [1] 3985 12
145 outliers removed and the data has a better, normal distribution.
original:
subset:
Both the original dataset and the subset show that free sulfur dioxide has a mostly normal distribution with outliers in the upper end.
## [1] 3942 12
43 outliers removed and the distribution is normal.
original:
subset:
In the original dataset, total sulfur dioxide has a wide distribution with outliers in the upper and lower ends of the range.
The subset data is more normalized but with some outliers in the upper end of the range.
## [1] 3939 12
3 outliers were removed, and total sulfur dioxide has a normal distribution.
original:
subset:
In the original dataset, density distribution is mostly normal, slightly right-skewed and has outliers in the upper end of the range.
The subset data is more normalized, slightly right-skewed, and with no outliers shown in the boxplot. For consistency reasons, I will still perform the outlier calculations on the subset.
## [1] 3939 12
As I suspected, no outliers were removed after the calculations.
original:
subset:
In both the original dataset and the subset, pH has a mostly symmetrical, wide distribution with outliers on the upper and lower ends of the range.
## [1] 3885 12
54 outliers were removed and the distribution is normal.
original:
subset:
In both the original dataset and the subset, sulphates have a distribution that is slightly right-skewed, with outliers in the upper end of the range.
## [1] 3797 12
88 outliers removed and the data is bimodal, and slightly right-skewed.
original:
subset:
In both the original dataset and the subset, alcohol has a right-skewed, wide, and somewhat uniform distribution with no outliers. However, as I did above, I will perform the outlier calculation for consistency.
## [1] 3797 12
As suspected, no outliers were removed after the calculations.
I think that the fastest way to identify correlations between variables would be to create a pairs plot.
Based on the plot above, there are a few variables that I’d like to create individual correlation plots to get a better view.
## [1] 0.8474419
Density and residual sugar have the strongest correlation with a score of 0.85. The plot confirms this within a tight relationship and a positive correlation.
## [1] 0.6210167
Free sulfur dioxide and total sulfur dioxide have the next strongest correlation with a score of 0.62. Even though free sulfur dioxide is a continuous variable, there does seem to be a slight pattern of discrete grouping in the way the variable is recorded.
## [1] 0.554056
Density and total sulfur dioxide have a correlation score of 0.55. The plot confirms a positive correlation between these two variables.
## [1] 0.5035444
Density and chlorides have a correlation score of 0.50. This is also confirmed with the positive correlation plot between these two variables.
## [1] -0.8195218
Density vs alcohol is my strongest negative correlation with a score of -0.82. This is confirmed in the plot with a strong negative correlation.
## [1] 0.413995
This plot shows that most of the alcohol observations are in quality score 6, with an overall, positive correlation score of 0.41.
Of the plots above, I’d like to include quality and see the correlations.
Here, I can see that the quality score of 5 has the strongest correlation between residual.sugar and density.
total.sulfur.dioxide and free.sulfur.dioxide have an almost uniform positive correlation with every unique quality observation with 5 being the strongest.
total.sulfur.dioxide and density also have a positive correlation with every unique quality observation. The quality score of 4 stands out, however, the quality score of 5 seems to be the strongest.
Much like the plot above, residual.sugar and total.sulfur.dioxide have a positive correlation with every unique quality observation. The quality score of 4 stands out, however, the quality score of 7 seems to have the strongest positive correlation path.
This plot shows that as alcohol content decreases and density increases, the quality drops.
I chose alcohol content as my first plot because I found it interesting how wide and somewhat uniform the distribution is compared to all the other variables.
I chose this as my second plot to show how there is a strong positive correlation between density and residual.sugar, yet a strong negative correlation between density and alcohol.
I chose this plot to reflect the strong positive correlation between residual.sugar and density with quality factored in the point colors and correlation lines.
From this analysis, I found that better tasting white wine is low in density and low in residual sugar. I was also surprised to discover how white wine with a high alcohol by volume improved quality score only by perception, as this variable did not correlate to any other chemical variable. This makes me wonder if the taste testers knew and were influenced by information about each of the white wines they tested, affecting the quality score.
Some of the struggles I had were sticking to the idea of a “quick and dirty” analysis. Through much of this project, I wanted to make beautiful looking plots, so I spent a lot of wasted time researching plot aesthetics and not analyzing the data. Once I realized what I was doing and that I needed to keep it “quick and dirty” the analysis and plotting came more easily. However, after my revision, most of my plots are not “quick and dirty”.
I do think this dataset could be used in machine learning for predictive models of what physiochemicals influence the taste of white wine. However, it should be used alongside many other similar datasets from controlled taste testing events. It should also be noted that the sensory quality score of wine would most likely vary in different regions of the world. What is considered a great tasting wine in one region, may be completely opposite in another.
The dataset used in this analysis was downloaded from:
For more information regarding this dataset, please visit these sites [Cortez et al., 2009]: